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Abstract 


We study the Gaussian Process regression model in the context of training data 
with noise in both input and output. The presence of two sources of noise makes 
the task of learning accurate predictive models extremely challenging. However, 
in some instances additional constraints may be available that can reduce the un¬ 
certainty in the resulting predictive models. In particular, we consider the case of 
monotonically ordered latent input, which occurs in many application domains 
that deal with temporal data. We present a novel inference and learning ap¬ 
proach based on non-parametric Gaussian variational approximation to learn the 
GP model while taking into account the new constraints. The resulting strategy 
allows one to gain access to posterior estimates of both the input and the output 
and results in improved predictive performance. We compare our proposed mod¬ 
els to state-of-the-art Noisy Input Gaussian Process (NIGP) and other competing 
approaches on synthetic and real sea-Ievel rise data. Experimental results suggest 
that the proposed approach consistently outperforms selected methods while, at 
the same time, reducing the computational costs of learning and inference. 


1 Introduction 

Uncertain or noisy data, both in input and the output, is a common problem that cannot be avoided in 
many real world applications. Neglecting this uncertainty will results in inaccurate predictive mod¬ 
els, particularly when the noise is large. Most machine learning models and settings only consider 
the output noise, and devise ways to effectively mitigate its presence. Noise in the input is considered 
less frequently, typically in the context of error-in-variable models IH The presence of input noise 
is typically more difficult to handle than the additive output noise, largely because of the nonlinear 
dependence of predictors on its input. To address this challenge traditional sampling-based Monte 
Carlo Markov chain (MCMC) techniques are often employed, however they are time-consuming 
and will not be appropriate for large datasets. Explicit integrating out of the input uncertainty, when 
the input density is known, is intractable in common situations 121 , ||3l. 

The challenge of handling noisy input becomes more daunting when other sources of prior knowl¬ 
edge of the unobserved true input are present and need to be taken into account. In many applications 
dealing with time-series data and in particular in earth sciences, it may be known that the samples 
{{Vi, ti)}, with uncertainty in both the output yi and input U, must be ordered, e.g., that the under¬ 
lying latent noise-free estimates corresponding to U, U+i. Eor instance, in measurements 
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of historical sea-level, which are often based on geological records, it is known that certain mea¬ 
surements precede others in time, although their exact ages remain unknown. The uncertainty in 
input (age) obtained from carbon ^^C-dating can often be large enough to yield high likelihood of 
miss-ordering e.g., Pr(ti > fi+i) > 1 — e, e > 0, yet requires < r^+i. Incorporating such 
ordering constraints into noisy input learning is, however, nontrivial. 

A number of approaches to dealing with input noise have been developed in the context of Gaussian 
Processes (GPs). For example, Girard and Smith in Q proposed to use a second Taylor expansion 
around the input mean to obtain a new corrected GP that accounts for the uncertain inputs. An 
alternative approach is to correct the covariance matrix in GPs under the presence of input noise 
and was introduced in jS). The corrected covariance matrix was determined by computing the ex¬ 
pectation of the covariance function with respect of the input distributions. The closed form of the 
expectation was provided in for linear, polynomial and squared exponential covariance function. 
Recently, McHutchon in Q developed a simple but effective method called noisy input GP (NIGP) 
that was showed to outperform the previous approaches. The basic idea of NIGP is to refer the input 
noise to the output noise by using a first order Taylor expansion around the noisy inputs, similar 
to traditional error-in-variable approaches. A procedure to iteratively optimize the input noise pa¬ 
rameters and GP hyper-parameters was also provided. Although NIGP was shown to perform well 
on synthetic datasets, it remains to share common limitations with related approaches. In particu¬ 
lar, incorporating prior information into the NIGP framework is challenging. Next, NIGP may not 
perform well in cases of large input noise due to its dependence on (first order) Taylor expansion. 
Finally, NIGP does not provide an immediate means to estimating the posterior density of the latent 
input, a task which is often of interest in practical applications. 

In this work we propose a new approach to learning GP models from data corrupted by dual input- 
output noise, in the setting when ordering constraints on the latent input are present. Depending on 
the quantification of ordering constraints, the task of learning the GP models, estimating the posterior 
of the latent (but ordered) input, and the posterior of the output become nontrivial. In particular, the 
densities of interest cannot be computed analytically nor do they remain in the exponential family. 
To address these challenges in Sec [3 we formulate a non-parametric variational approach based 
on recent work in [Si in the context of ordered input noisy GPs. We demonstrate how additional 
approximations can be used to yield tractable inference and learning in these models, as outlined in 
Sec. 13.II Finally, in Sec.|5]we demonstrate the utility of our approach by contrasting its performance 
to state-of-the-art models, including NIGP and a sampling-based MCMC solution. 


2 Problem formulation 


We consider the following non-linear regression model ?/i = f{Ti) + ey^i where are explana¬ 

tory variables , yi are response variables and Cy^i are zero mean Gaussian output noise variables with 
known standard deviations CTj, i. In our work, the true input variables r = {Ti}f^i are not observed 
and what we actually observe are their noisy versions. We assume a classical error-in-variable model 
here to obtain the noisy inputs: i.e U = Ti + et,i, where et,i ^ Af{et,i\0, crt,i) is an additive zero 
mean Gaussian noise that is independent from r^. 

The latent true output variables {f{Ti)}fLi, are assumed to have a GP prior with zero mean and 
a covariance function fce(T, r'). We assume in this paper the covariance function is stationary. 
The reason that we use a GP framework here is because of its flexibility due to nonparametric 
property and its ability to handle uncertain data as previous works suggested. Learning in usual 
GPs involve choosing the optimal hyper-parameters 9* by maximizing the log-marginal likelihood: 
6* = argmaxe f log (Pr(ylf)Pr( flT))df = —y^Kgy — log\Kg\ + const. Where Kg is the train¬ 
ing covariance matrix, {Kg)ij = keiji.Tj) -I- CTy — j); Vi, j = l,2,..,n; /(.) is the indicator 
function. 

The prior knowledge in our model is that the latent true inputs satisfy: ti > T 2 > ... > tjv. The 
final goal is to predict the function value /(t*) of an unseen sample r*. Without noisy input data L, 
/(r*) is Gaussian variable with mean and variance alternatively given by ||7l : 


E[f{T*)\ = k{T,T*fK-^y 


( 1 ) 
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Var[f{T*] = k{T*,T*) — k{T,T*)K ^k{T,T*) 


( 2 ) 


3 Nonparametric Gaussian variational inference model 


Here we present our method that can overcome those difficulties listed in previous sections. First, in 
order to guarantee the monotonic order of we use the following variable transformation: 

r = Tn\k = log(ri - Ti+i) Mi = 1,2, ..,n - 1 (3) 


We will model the random variables r, li instead of because we do not have the constraints any¬ 
more. For simplicity, we assume that k and r have a uniform prior, i.e Pr{li) c>c 1; Vz; Pr{r) oc 1. 
The log mariginal likelihood log Pr{y\0) can be bounded below by introducing a variational distri¬ 
bution Q{1, r) as follows: 


Pr{y,l,r\e) 


log Pr {y\0) = log J Pr{y,l,r\0)dldr^ > J J Q{l,r)log q^i 

11QV, r) log dUr = logPr(,|^) - / / <3(i,.) log 


Q{l,r) 


dldr 


dldr 

(4) 


In order to maximize the log marginal likelihood, we seek to find a variational distribution Q{l,r) 
which belongs to a tractable distribution family and minimize the KL divergence from Q to 
Pr{l, r\y, 0) . We choose Q to be a mixture of K Gaussians to capture the possible multimodality 
ofPr{l,r\y) HI 




(5) 


Vi = diag{vi); $ = {mi,uJ;Vz = 1, 2, ..,K} 


(6) 


Here we choose Vi to be an isotropic covariance matrice for optimization convenience. The set of 
variational parameters T* can be found by minimizing the KL divergence between the variational 
distribution Q and the true posterior distribution Pr{l, r\y, t, at,0). The objective function that we 
need to minimize is: 


F{<^]0) = KL[Q{l,r)\\Pr{l,r\y,t,at,0)] 

Where: 


H[Q] - Eq log Pr{t\l,r, at) - Eq log Pr{y\l;0) 

(7) 


H[Q] = - J J Q{l,r) log Q{1, r)dldr 


( 8 ) 


EQlogPr(t\l,r,at) = Eq[ — 


2ai 


2 

Kn 


EQlogPr{y\l,0) =Eq(^ - ]^y'^K{l) ^y - ^log\K{l)\ + const^ 


+ const (9) 

( 10 ) 


In the above equations, the entropy term H[Q] can be bounded above by using Jensen inequality : 

K K 

H[Q\>-Y,-Y,M{m,-mj,V, + Vj) ( 11 ) 

i=i ^ j=i 

The expected log likelihood with respect of t, Eq log Pr{t\l, r, at) can be computed analytically 
with detailed derivation included in the supplementary material.. 
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In Eq. (11) for stationary covariance functions the log marginal likelihood log Pr{y\l; 9) depends 
only on I but not r. Its expecation will be a highly nonlinear function of I due to its appearance in 
the inverse matrix K{1) and the expecation does not have a closed form. We approximate this term 
by a second Taylor expansion around the means mi,i = 1, 2,iT. 

1 ^ / 1 \ 
EQ\ogPr{y\l\e) ^ log Pr{y\m^\9) + -trace{\7f log Pr{y\l-,9)i=miVi) j (12) 

We can iteratively optimize $ and 9 and based on the final optimal values of and to 
determine an estimation of the true inputs. Then we uses the these estimated quantities for future 
prediction based on Eq. (1) & (2). One important point is that minimization of F{^', 9) requires us 
to compute the gradient and the Hessian of logPr{y\l, 9) as you can see in Eq. (11) which might 
take a lot of time. In the next section, we will present the key idea to compute these terms efficiently. 


3.1 Using chain rule for faster computation 


Procedures of computing the gradient jjjg main diagonal entries of the Hessian 

^ provided in the Appendix section. Generally we have to compute ^ and 

respectively. However, for each i to compute the gradient of the covariance matrix K with respect 
to li we have to compute ; Vj, k G {1, 2,.., N} st : j > i > k — 1 since Kjk is a function 
of exp{j) + exp{j + 1) + .. + exp{k — 1). This would take 0{N‘^) for each i and 0{N^) in total 
for all i. The same problem happens to calculation of the Hessian, when we need to figure out 


dl^ 


Vj > i > fc — 1 . 


Nevertheless, we can use the intermediate results of to compute thus reducing 

the total computation time as follows. First, note that = 0 for h ^ {j, fcjand = 0 for h — 

1 > = e** for h < i. Second according to the chain rule: 


dKjk _ 9Kjk dxh _ dKjk dr^ _ ( dKj^ 
dli ^ dTh dli ^ dTh dli dTh dn 

h—1 h—1 h—1 

Thus, ^ . Since computing of ^ takes 0{N) time so does overall 

determined in 0{N'^) time. 

The same trick can be applied to compute we have: 


d'^Kjk f dKjkdrl dTh^2\ , f dK'^k dThdTh'\ 


Tfc, k{Tj\Tk) = k{Tj + A;Tk + A),' 

on this observation to compute ^j^we only need to compute for h > i at i-step ano it i 

0{N) time. Hence the overall complexity for calculating Vz is 0{N^). 


*We omit the dependence of the loglikelihood on the hyperparameter 0 for brevity of exposition here. 
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4 Baseline method 


In order to demonstrate the benefits of our proposed model, we will compare our model with MCMC, 
NIGP and GP in experiments. The detail of MCMC-based method for handling noisy inputs with 
ordering constraints is given below. 

4.1 Monte Carlo Markov Chain sampling 

We can use MCMC for drawing samples of r and 9 from the joint posterior distribution 
Pr{T\9\y,t,at) c>c Pr{y\T\9)Pr{t\T,at)Pr{T)Pr{9). For simplicity, we assume both r and 0 
have a uniform prior distribution. 

We follow Metropolis-Hasting sampling stragey here, when a Gaussian distribution will be used as 
the proposal distribution. In particular, at (kH-l)-th iteration for each i G {1, 2,.., n} we generate a 
new potential true inputs rf using a Gaussian function ; rf) centered at rf. Since we have 

ordering constraints over the true inputs, i.e > r^+i so we propose to discard this sample 

if it violate the constraints and continue to the next index of inputs, z + 1. Otherwise, we will accept 
this sample with the probability 



Where S denotes the set of currently stored samples for r, card{S) = N. Note that in order to 
compute the acceptance probability, we need to compute the likelihood of y given the currently 
stored samples without one element, and this costs 0{N^) operations for inverting the covariance 
matrix. We can reduce the time complexity at each step as follows. Note that when we iterate each 
location z of r in turn, the covariance matrix will change only on i-th row and i-th column. Based 
on this observation we can reduce the running time of MCMC by applying the following Woodbury 
matrix inversion lemma: 


K ^uv^K ^ 
1 + K~^u 


{K + zzu^) ^ = K ^ 


(16) 


This reduces the complexity of updates from 0{N^) to 0{N‘^). 

5 Experiments 

In this section we first demonstrate the effectiveness of our proposed method on artificial datasets 
since there is no available real datasets with groundtruth. Then we run the proposed model with 
MCMC, NIGP and GP to see if there is any difference on Northeastern Florida dataset where no 
groundtruth exists. 

5.1 Evaluation on synthetic datasets 

Here we consider the set of experiments that were suggested in with the added ordering con¬ 
straints on the latent input variables r. In particular, there are 25 input points that are equally spaced 
in [—10; 10] and we are trying to learn functions that have varying gradients across the input space. 
The range of all functions are the same, 10. The reason for doing this is we want to see behavior 
of methods in two cases: small and large output noise. For the output noise, we consider a small 
CTy = 0.05 and a large ay = 1 noise setting. The input noise will vary in the range at G [0.2,3]. 

We employ a Matern covariance with here because it was used in previous works, eg [8] for sea level 
modelling 



; 9 = af, d 


We will compare four methods: our proposed method, MCMC, NIGP and usual GP. Experiments 
settings for each method are as follows: 
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Input Noise Level (Std) 

(d) 0.055r^ tanh(cos(r)) 










Figure 1; (a)-(e) Comparison among four methods: our proposed approach, MCMC, NIGP and 
GP on hve latent functions based on prediction error, (f)- (j) Comparison between our work and 
MCMC in term of the ability to estimate the true inputs. The output noise is a small constant value: 
(Ty = 0.05, while the input noise changes from 0.2 to 3. The baseline error is the mean absolute 
difference between noisy inputs t and t. 


For our proposed model we used a mixture of Gaussians with K = S for the variational distribu¬ 
tion Q{l,r) and scaled conjugate gradient to minimize F'($) in Eq.(12). To avoid local minima, 
we run our model hve times each time with different inital points of parameters and choose the 
model that return the smallest objective function value. We set a default value of 5000 iterations for 
MCMC since based on our experiments, this value is large enough for convergence to the stationary 
distribution. For NIGP we used the Matlab’s global optimization toolbox to learn the GP hyper¬ 
parameters. The average training time (s) after hve single runs for our proposed method, MCMC, 
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NIGP and GP alternatively are 19(±1.2), 61(±5.5), 106(±4.6), 0.3(±0.008). This confirms that 
our proposed method is computationally significantly more efficient than both NIGP and MCMC. 

Next, we compare four methods based on prediction errors in the case of small output noise. The 
criateria for comparison is the root mean squared error (RMSE). For each case of input noise level, 
we run each method five times, then take the average prediction errors with the standard deviations. 
Fig (1). (a)-(e) indicates the peformance of four methods for each function. 

Based on those figures. Our proposed method demonstrates predictive performance on par with 
MCMC, while being significantly more computationally efficient. The method consistently outper¬ 
forms GP and NIGP that do not utilize the ordering constraints, and hence are unable to effectively 
deal with the input noise. To test the ability of different methods to recover the true input, we exam¬ 
ined the input estimation errors for MCMC and our proposed approach. This is depicted in Fig.(l) 
f-j. Note that neither GP nor NIGP explicitly seek to recover the input estimates. As the evaluation 
criteria, we use the mean absolute error(MAF) and contrast it with the amount of noise in the in¬ 
put. Experimental results indicate that both MCMC and our proposed approach effectively reduce 
the amount of noise in the input, which, in turn, enables more accurate function prediction. Again, 
our proposed approach accomplishes this task in a computationally more efficient manner than the 
competing MCMC 

However, when the output noise is large, ay = 1, the improvement of our method over NIGP and GP 
is not much. Only three out of five cases of selected functions, our work performs clearly better than 
GP with different input noise levels. In the last case of function/(r) = 1.761og(T^ sin(2r-|-2)-|-l), 
NIGP even has smaller prediction errors than our work. We provide experimental results of this case 
in supplementary material. 


5.2 Application to sea level estimation 

We demonstrate the application of the proposed model on the reconstruction of sea level in Noth- 
eastern Florida from 700 BC to 2010 AD. We used the dataset that was provided in supplenemtary 
data described in ||9l, consisting of 77 data points, ranging from 560 BC to 2010 AD. Among them 
65 instances have noisy inputs. The standard deviation of output noise at the 65 noisy input measure¬ 
ments is constant and very large, ay = 181 ~ rnax(y)-min{y) ^ output noise at the remaining 
12 instances is smaller. We used the same settings of all methods as the previous experiments with 
synthetic datsets. 

Prediction results for the four methods are displayed in Fig. (2). A more insightful look can be 
gained by considering at the differences in the predictions of the four methods. In Tab. (1) we 
show the average absolute pairwise differences between mean predictions of different approaches, 
together with the average symmetrized KF divergence of predictive densities on query points. All 
differences are statistically significant at 5% level, indicating that different methods make could lead 
to alternative explanations of the sea-level rise history as well as result in different predictive models. 


Table 1: Mean absolute error and symmetrized KF divergence for measuring the 
difference between predictions of four models 


(a) Mean absolute differences between mean 
predictions of any pair of four methods 

NVP MCMC NIGP GP 


NVP - 1.23 3.49 3.42 

MCMC - - 2.80 2.73 

NIGP - - - 0.069 

GP - - - - 


(b) Symmetrized KL divergence between predictive 
posterior distribution of any pair from four methods 

NVP MCMC NIGP GP 


NVP - 0.71 3.2 3.02 

MCMC - - 0.98 0.88 

NIGP - - - 0.003 

GP - - - - 


Our model could be used to predict the sea level rise rate in the future. These outcomes are of 
particular concerns in the context of climate science research and suggesting possible reactions to 
the threat of the sea level rise. However, this should be done with climate-driven data which we do 
not have for now and we will leave this for our future work. 
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(a) Sea level reconstruction by NPV (b) Sea level rise reconstruction by MCMC 




(c) Sea level reconstruction by NIGP (d) Sea level reconstruction by standard GP 

Figure 2: Reconstruction of Northeastern Florida sea level using four methods. The noisy obser¬ 
vations {t, y) with input and output noise level were marked by a blue curve and the error bars in 
both directions indicate the input and output noise level. The mean predictions with ± one standard 
deviation for each method were also plotted. It seems that there is little diffence among methods.. 


6 Conclusions and discussion 

In this paper, we have introduced an efficient and effective GP-based method to handle noisy inputs 
when we knew the order of the latent true inputs variables. We transformed those latent variables to 
obtain unconstrained ones and a Bayesian treatment was applied to infer the posterior distribution 
of the new variables. The experiments indicated the improvement of our model over NIGP in term 
of running time and prediction error. Our model assumes that the input and output noise level 
are known because these quanties are given in sea level domains. In other practices, however we 
can consider them as parameters and modify the objective function to optimize them. Our future 
research work involves applying the proposed model to reconstruct global sea level. In this case, 
besides temporal inputs we have spatial information, the location of sea level, the problem becomes 
more complicated. 


References 

[1] P. Dellaportas and D.A. Stephens. Bayesian analysis of errors-in-variables regression models. Biometrics, 
51:1085-95, 1995. 

[2] Andrew Mchutchon and Carl E. Rasmussen. Gaussian process training with input noise, pages 1341-1349, 
2011 . 

[3] Michalis Titsias and Neil Lawrence. Bayesian gaussian process latent variable model. 2010. 

[4] Agathe Girard and Roderick Murray-Smith. Learning a gaussian process model with uncertain inputs. 
2003. 


8 















[5] Patrick Dallaire, Camille Besse, and Brahim Chaib-draa. Learning gaussian process models from uncertain 
data. In Chi-Sing Leung, Minho Lee, and Jonathan Hoyin Chan, editors, ICONIP (1), volume 5863 of 
Lecture Notes in Computer Science, pages 433^40. Springer, 2009. 

[6] Agathe Girard, Carl Edward Rasmussen, Joaquin Quinonero-Candela, and Roderick Murray-Smith. Gaus¬ 
sian process priors with uncertain inputs, application to multiple-step ahead time series forecasting. 2003. 

[7] Carl Edward Rasmussen and Christopher K. 1. Williams. Gaussian Processes for Machine Learning (Adap¬ 
tive Computation and Machine Learning). The MIT Press, 2005. 

[8] Samuel Gershman, Matthew D. Hoffman, and David M. Blei. Nonparametric variational inference. CoRR, 
abs/1206.4665, 2012. 

[9] Andrew C Kemp, Christopher E Bernhardt, Benjamin P Horton, Robert E Kopp, Christopher H Vane, 
W Richard Peltier, Andrea D Hawkes, Jeffrey P Donnelly, Andrew C Parnell, and Niamh Cahill. Late 
holocene sea-and land-level change on the us southeastern atlantic coast. Marine Geology, 357:90-100, 
2014. 


9 



